MEM T380

Case Studies Group 10

Ante Sokosa
Ziad Hatab

HW2a

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn import preprocessing
# set seaborn's default settings
sns.set()

Import Data¶

We import the data into multiple dataframes by their subset for later use, but also can concatenate them for current use.

In [ ]:
excel_file = 'weld_defect_dataset.xlsx'
subsets = []

for i in range(1, 6):
    subset = pd.read_excel(excel_file, sheet_name = 'subset' + str(i))
    subset = subset.rename(columns={'Type':'type',
                                    'W':'w',
                                    'Ar':'ar',
                                    'Sp':'sp',
                                    'Re':'re',
                                    'Rr':'rr',
                                    'Sk':'sk',
                                    'Ku':'ku',
                                    'Hc':'hc',
                                    'Rc':'rc',
                                    'Sc ':'sc',
                                    'Kc ':'kc'}) #note the space after Sc and Kc are errors in naming in the excel file and are corrected here for ease of use later
    subsets.append(subset)

subsetsall = pd.concat(subsets, ignore_index=True)
subsetsall
Out[ ]:
type w ar sp re rr sk ku hc rc sc kc
0 PO 0.008596 0.006897 0.5748 0.838397 0.998562 0.091802 0.908459 0.003151 0.111302 0.256742 0.389952
1 PO 0.010029 0.003448 0.4112 0.838397 0.649317 0.039172 0.476520 0.002817 0.121299 0.332611 0.443785
2 PO 0.007163 0.003448 0.4400 1.007173 0.754309 0.048079 0.766430 0.002621 0.127759 0.323068 0.444515
3 PO 0.028653 0.003448 0.3124 0.534599 0.061617 0.244800 0.789110 0.010007 0.092632 0.220312 0.339685
4 PO 0.018625 0.003448 0.4024 0.557089 0.037346 0.578774 0.630554 0.006757 0.073914 0.270908 0.273045
... ... ... ... ... ... ... ... ... ... ... ... ...
215 CR 0.277937 0.949262 1.0268 0.102869 0.723013 0.025025 0.468658 0.101296 0.757683 0.231426 0.516244
216 CR 0.148997 0.720690 0.8172 0.055527 0.509504 0.135456 0.551284 0.010890 0.262126 0.410800 0.530843
217 CR 0.320917 0.846359 0.7100 0.106793 0.407912 0.027538 0.488077 0.191586 0.757547 0.158517 0.559012
218 CR 0.322350 0.578386 0.6420 0.143629 0.384393 0.039732 0.492730 0.154902 0.640716 0.218541 0.567931
219 CR 0.372493 0.799686 0.8580 0.167046 0.235256 0.075930 0.558360 0.268964 0.637409 0.164191 0.586349

220 rows × 12 columns

For Reference:

image.png

We now have indivudal dataframes for each subset of data, but we also have one large dataframe for current use / data exploration where applicable:

In [ ]:
print(subsets[0].shape)
print(subsets[4].shape)
print(subsetsall.shape)
(44, 12)
(44, 12)
(220, 12)

With .info() we see that the data is in a very clean format, with no missing values (220 non-null), and all data types are correct.

In [ ]:
subsetsall.info() 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220 entries, 0 to 219
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   type    220 non-null    object 
 1   w       220 non-null    float64
 2   ar      220 non-null    float64
 3   sp      220 non-null    float64
 4   re      220 non-null    float64
 5   rr      220 non-null    float64
 6   sk      220 non-null    float64
 7   ku      220 non-null    float64
 8   hc      220 non-null    float64
 9   rc      220 non-null    float64
 10  sc      220 non-null    float64
 11  kc      220 non-null    float64
dtypes: float64(11), object(1)
memory usage: 20.8+ KB

or we can do so with .dtypes and .isnull().sum().sum() :

In [ ]:
subsetsall.dtypes
Out[ ]:
type     object
w       float64
ar      float64
sp      float64
re      float64
rr      float64
sk      float64
ku      float64
hc      float64
rc      float64
sc      float64
kc      float64
dtype: object
In [ ]:
subsetsall.isna().sum().sum()
Out[ ]:
0

We can also use .describe() to get a quick overview of the data:

In [ ]:
subsets[0].describe()
Out[ ]:
w ar sp re rr sk ku hc rc sc kc
count 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000
mean 0.193312 0.219518 0.596164 0.408728 0.244100 0.190435 0.618382 0.114831 0.227141 0.222136 0.468713
std 0.220107 0.260694 0.217955 0.274365 0.223650 0.171583 0.145415 0.157291 0.209002 0.113051 0.138047
min 0.007163 0.003448 0.192400 0.004051 0.001971 0.011205 0.269329 0.001358 0.032748 0.002616 0.100178
25% 0.041189 0.014874 0.415600 0.145802 0.087699 0.080167 0.536612 0.017436 0.110421 0.142772 0.412662
50% 0.088109 0.062179 0.551800 0.492258 0.169426 0.115660 0.585224 0.038245 0.155732 0.215720 0.468566
75% 0.280444 0.393534 0.830500 0.599842 0.330571 0.243957 0.672636 0.131052 0.218242 0.292278 0.545424
max 1.000000 0.826724 0.928000 1.007173 0.998562 0.681613 1.113649 0.617477 1.001281 0.571364 0.911416
In [ ]:
subsets[4].describe()
Out[ ]:
w ar sp re rr sk ku hc rc sc kc
count 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000
mean 0.175860 0.220914 0.623318 0.391490 0.243808 0.190940 0.582134 0.080576 0.219547 0.225331 0.502365
std 0.188479 0.276367 0.209529 0.277337 0.203288 0.221725 0.163183 0.095121 0.180705 0.152141 0.164504
min 0.007163 0.003448 0.157200 0.002152 0.000200 0.001608 0.168895 0.002135 0.013219 0.001474 0.155346
25% 0.038682 0.009670 0.438300 0.135696 0.093246 0.074033 0.495114 0.015487 0.099992 0.131640 0.416361
50% 0.090974 0.059112 0.638000 0.437933 0.199206 0.132192 0.563511 0.035016 0.178824 0.213700 0.498491
75% 0.263968 0.362931 0.783000 0.606814 0.334925 0.158128 0.643986 0.125883 0.265546 0.309116 0.582118
max 0.816619 0.949262 1.026800 1.007173 0.817756 1.002376 1.128828 0.378949 0.757683 0.729507 0.990413
In [ ]:
subsetsall.describe()
Out[ ]:
w ar sp re rr sk ku hc rc sc kc
count 220.000000 220.000000 220.000000 220.000000 220.000000 220.000000 220.000000 220.000000 220.000000 220.000000 220.000000
mean 0.175905 0.207577 0.599259 0.392519 0.249057 0.167519 0.604765 0.092316 0.216053 0.240740 0.478316
std 0.192450 0.256669 0.216377 0.265337 0.208797 0.164088 0.150983 0.138605 0.173100 0.143031 0.150625
min 0.001433 0.003448 0.025200 0.000591 0.000118 0.001608 0.168895 0.000013 0.004129 0.001474 0.028573
25% 0.035817 0.011860 0.415600 0.131772 0.086298 0.065242 0.519888 0.012539 0.107777 0.145720 0.372002
50% 0.078080 0.062179 0.586600 0.412764 0.213024 0.113033 0.571244 0.033709 0.158049 0.216357 0.481886
75% 0.277937 0.362931 0.826200 0.604219 0.339437 0.198047 0.670390 0.111988 0.254074 0.317875 0.572017
max 1.000000 1.037931 1.026800 1.007173 1.003975 1.002376 1.202949 1.049198 1.001281 1.000876 1.025173

From the 3 .describe() commands, it can be seen that all subsets seem to have similar data, so for further data exploration we will use the full concatenated dataset.

In [ ]:
# original subset is 44 rows
# original sebsetsall is 220 rows (44*5)

for i in range(5):
    subsets[i].drop_duplicates(inplace=True)
    print(subsets[i].shape)

print(subsetsall.shape)
subsetsall_temp = subsetsall.copy()
subsetsall_temp.drop_duplicates(inplace=True)
print(subsetsall_temp.shape)
(44, 12)
(44, 12)
(44, 12)
(44, 12)
(44, 12)
(220, 12)
(219, 12)

We can see that no individual subset has any duplicated entires, but the concatenated dataframe does. With the precision of 4-6 decimal numbers over 11 columns in the dataframe, it is unlikely to be a true measured duplicate and more likely to be an duplicate entry. We will leave the dublicate removed from the concatenated dataframe and remove it from one of the subsets.

In [ ]:
duplicated_row = subsetsall[subsetsall.duplicated()]
duplicated_row
Out[ ]:
type w ar sp re rr sk ku hc rc sc kc
50 PO 0.015759 0.003448 0.4552 0.627426 0.056636 0.116363 0.678178 0.004218 0.107777 0.266969 0.444385
In [ ]:
# Assuming you have already read the DataFrames into 'subsets' list
duplicated_row = subsetsall[subsetsall.duplicated()].iloc[0]

def is_same_row(row, target_row):
    return row.equals(target_row)

for i in range(5):
    # Check if each row in the DataFrame is the same as the example_row
    same_rows = subsets[i].apply(is_same_row, axis=1, args=(duplicated_row,))
    if same_rows.any():
        print("Subset " + str(i+1) + ":")
        print("Rows that are the same as the example row:")
        print(subsets[i][same_rows])
Subset 1:
Rows that are the same as the example row:
  type         w        ar      sp        re        rr        sk        ku  \
6   PO  0.015759  0.003448  0.4552  0.627426  0.056636  0.116363  0.678178   

         hc        rc        sc        kc  
6  0.004218  0.107777  0.266969  0.444385  
Subset 2:
Rows that are the same as the example row:
  type         w        ar      sp        re        rr        sk        ku  \
6   PO  0.015759  0.003448  0.4552  0.627426  0.056636  0.116363  0.678178   

         hc        rc        sc        kc  
6  0.004218  0.107777  0.266969  0.444385  

It can be seen from the above that subset1 and subset2 have their index 6 row as the same. For reasons mentioned earlier we will remove this row from one of the subsets, in this case subset2 as it came later in the data than subset1.

In [ ]:
subsets[1].head(10)
Out[ ]:
type w ar sp re rr sk ku hc rc sc kc
0 PO 0.008596 0.003448 0.6420 0.416456 1.003975 0.116834 0.961553 0.003549 0.125551 0.345271 0.407876
1 PO 0.012894 0.003448 0.3784 0.235612 0.599335 0.100720 0.661515 0.003367 0.100663 0.161510 0.362336
2 PO 0.012894 0.003448 0.3784 0.235612 0.472149 0.015586 0.373776 0.003766 0.070719 0.240516 0.371707
3 PO 0.010029 0.003448 0.2152 0.979030 0.421204 0.043400 0.783220 0.003151 0.004129 0.017908 0.028573
4 PO 0.020057 0.003448 0.5600 0.527511 0.050374 0.211741 0.725096 0.008584 0.084605 0.281740 0.329137
5 PO 0.011461 0.003448 0.2996 0.773502 0.310067 0.058798 0.536005 0.001847 0.116556 0.199172 0.441565
6 PO 0.015759 0.003448 0.4552 0.627426 0.056636 0.116363 0.678178 0.004218 0.107777 0.266969 0.444385
7 PO 0.035817 0.003448 0.4156 0.669620 0.007348 0.266460 0.738220 0.016878 0.039208 0.125001 0.242095
8 PO 0.011461 0.003448 0.3000 0.773502 0.121862 0.145167 0.534106 0.001259 0.123437 0.205165 0.502887
9 PO 0.053009 0.003448 0.4104 0.476751 0.081149 0.345283 1.202949 0.034153 0.087581 0.567489 0.475278
In [ ]:
subsets[1].drop(6, inplace=True)
print(subsets[1].shape)
(43, 12)
In [ ]:
subsets[1].head(10)
Out[ ]:
type w ar sp re rr sk ku hc rc sc kc
0 PO 0.008596 0.003448 0.6420 0.416456 1.003975 0.116834 0.961553 0.003549 0.125551 0.345271 0.407876
1 PO 0.012894 0.003448 0.3784 0.235612 0.599335 0.100720 0.661515 0.003367 0.100663 0.161510 0.362336
2 PO 0.012894 0.003448 0.3784 0.235612 0.472149 0.015586 0.373776 0.003766 0.070719 0.240516 0.371707
3 PO 0.010029 0.003448 0.2152 0.979030 0.421204 0.043400 0.783220 0.003151 0.004129 0.017908 0.028573
4 PO 0.020057 0.003448 0.5600 0.527511 0.050374 0.211741 0.725096 0.008584 0.084605 0.281740 0.329137
5 PO 0.011461 0.003448 0.2996 0.773502 0.310067 0.058798 0.536005 0.001847 0.116556 0.199172 0.441565
7 PO 0.035817 0.003448 0.4156 0.669620 0.007348 0.266460 0.738220 0.016878 0.039208 0.125001 0.242095
8 PO 0.011461 0.003448 0.3000 0.773502 0.121862 0.145167 0.534106 0.001259 0.123437 0.205165 0.502887
9 PO 0.053009 0.003448 0.4104 0.476751 0.081149 0.345283 1.202949 0.034153 0.087581 0.567489 0.475278
10 SL 0.018625 0.020690 0.6496 0.744641 0.304496 0.052827 0.370475 0.007743 0.187048 0.500302 0.330965

Reset index in the two dataframes with removed rows:

In [ ]:
subsetsall = subsetsall_temp.copy()

subsetsall.reset_index(drop=True, inplace=True) # use the drop=True to avoid the old index being added as a column, and having to drop it later
subsets[1].reset_index(drop=True, inplace=True)
subsets[1].head(10)
Out[ ]:
type w ar sp re rr sk ku hc rc sc kc
0 PO 0.008596 0.003448 0.6420 0.416456 1.003975 0.116834 0.961553 0.003549 0.125551 0.345271 0.407876
1 PO 0.012894 0.003448 0.3784 0.235612 0.599335 0.100720 0.661515 0.003367 0.100663 0.161510 0.362336
2 PO 0.012894 0.003448 0.3784 0.235612 0.472149 0.015586 0.373776 0.003766 0.070719 0.240516 0.371707
3 PO 0.010029 0.003448 0.2152 0.979030 0.421204 0.043400 0.783220 0.003151 0.004129 0.017908 0.028573
4 PO 0.020057 0.003448 0.5600 0.527511 0.050374 0.211741 0.725096 0.008584 0.084605 0.281740 0.329137
5 PO 0.011461 0.003448 0.2996 0.773502 0.310067 0.058798 0.536005 0.001847 0.116556 0.199172 0.441565
6 PO 0.035817 0.003448 0.4156 0.669620 0.007348 0.266460 0.738220 0.016878 0.039208 0.125001 0.242095
7 PO 0.011461 0.003448 0.3000 0.773502 0.121862 0.145167 0.534106 0.001259 0.123437 0.205165 0.502887
8 PO 0.053009 0.003448 0.4104 0.476751 0.081149 0.345283 1.202949 0.034153 0.087581 0.567489 0.475278
9 SL 0.018625 0.020690 0.6496 0.744641 0.304496 0.052827 0.370475 0.007743 0.187048 0.500302 0.330965

Use .describe() and .info() on our cleaned data (concatenated only is okay here):

In [ ]:
subsetsall.describe()
Out[ ]:
w ar sp re rr sk ku hc rc sc kc
count 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000
mean 0.176636 0.208510 0.599917 0.391446 0.249936 0.167753 0.604430 0.092718 0.216547 0.240620 0.478471
std 0.192585 0.256884 0.216652 0.265466 0.208867 0.164427 0.151247 0.138793 0.173341 0.143348 0.150952
min 0.001433 0.003448 0.025200 0.000591 0.000118 0.001608 0.168895 0.000013 0.004129 0.001474 0.028573
25% 0.037250 0.012357 0.415600 0.131266 0.087341 0.064804 0.519690 0.012782 0.108342 0.145656 0.371903
50% 0.078797 0.062834 0.587200 0.409072 0.214041 0.112573 0.570744 0.034153 0.158479 0.215147 0.482290
75% 0.277937 0.363793 0.826800 0.602742 0.341661 0.199904 0.669316 0.113712 0.254108 0.318871 0.572080
max 1.000000 1.037931 1.026800 1.007173 1.003975 1.002376 1.202949 1.049198 1.001281 1.000876 1.025173
In [ ]:
subsetsall.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 219 entries, 0 to 218
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   type    219 non-null    object 
 1   w       219 non-null    float64
 2   ar      219 non-null    float64
 3   sp      219 non-null    float64
 4   re      219 non-null    float64
 5   rr      219 non-null    float64
 6   sk      219 non-null    float64
 7   ku      219 non-null    float64
 8   hc      219 non-null    float64
 9   rc      219 non-null    float64
 10  sc      219 non-null    float64
 11  kc      219 non-null    float64
dtypes: float64(11), object(1)
memory usage: 20.7+ KB

We can also check for any previously labeled numerical data to turn into categorical data:

In [ ]:
subsetsall.nunique(axis=0) 
Out[ ]:
type      5
w       138
ar      164
sp      181
re      175
rr      219
sk      219
ku      219
hc      218
rc      219
sc      219
kc      219
dtype: int64

The only column with a low enough number on unique values to be considered categorical data is the type column, which it already is and we will leave it as so.

We can create a num list but a cat list is not necessary for this dataset as there is only one categorical column.

In [ ]:
nums = list(subsetsall.select_dtypes(exclude=['object']).columns)
nums
Out[ ]:
['w', 'ar', 'sp', 're', 'rr', 'sk', 'ku', 'hc', 'rc', 'sc', 'kc']

Visualize Data¶

PairPlot:

In [ ]:
sns.pairplot(subsetsall, vars=nums, hue='type')
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x21daec13070>

It can be seen that type PO spikes heavily in some cases such as hc vs hc and ar vs ac, but also is ponounced in other cases. We will talk about this later.

Currently, we can identify some features that are useful for finding trends. From initial visual inspeciton, re and rc plotted against any other feature seem to decently split the data into their categorical types. Specifically, re vs rc seems the individual best.

However, others are better at splitting only certain types.
For example, kc vs w, ar, rr, and hc split PO and SL quite well, while ar vs rc splits LP and CR well.

A heatmap of the correlation between the numerical data can be made:

In [ ]:
sns.heatmap(subsetsall[nums].corr(), annot=True)
Out[ ]:
<Axes: >

re by far has some very good correlations with other features, nameley w, ar, and sp. rc also has some good ones.
Our visual inference was accurate.

Noimalizaiton will likley not help here as it can already be seen that the data is near normal (max near 1 and min near 0):

In [ ]:
subsetsall.describe()
Out[ ]:
w ar sp re rr sk ku hc rc sc kc
count 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000
mean 0.176636 0.208510 0.599917 0.391446 0.249936 0.167753 0.604430 0.092718 0.216547 0.240620 0.478471
std 0.192585 0.256884 0.216652 0.265466 0.208867 0.164427 0.151247 0.138793 0.173341 0.143348 0.150952
min 0.001433 0.003448 0.025200 0.000591 0.000118 0.001608 0.168895 0.000013 0.004129 0.001474 0.028573
25% 0.037250 0.012357 0.415600 0.131266 0.087341 0.064804 0.519690 0.012782 0.108342 0.145656 0.371903
50% 0.078797 0.062834 0.587200 0.409072 0.214041 0.112573 0.570744 0.034153 0.158479 0.215147 0.482290
75% 0.277937 0.363793 0.826800 0.602742 0.341661 0.199904 0.669316 0.113712 0.254108 0.318871 0.572080
max 1.000000 1.037931 1.026800 1.007173 1.003975 1.002376 1.202949 1.049198 1.001281 1.000876 1.025173

We can try standardization to see if it helps:

In [ ]:
subsetsall_std = subsetsall.copy()
subsetsall_std.head()
Out[ ]:
type w ar sp re rr sk ku hc rc sc kc
0 PO 0.008596 0.006897 0.5748 0.838397 0.998562 0.091802 0.908459 0.003151 0.111302 0.256742 0.389952
1 PO 0.010029 0.003448 0.4112 0.838397 0.649317 0.039172 0.476520 0.002817 0.121299 0.332611 0.443785
2 PO 0.007163 0.003448 0.4400 1.007173 0.754309 0.048079 0.766430 0.002621 0.127759 0.323068 0.444515
3 PO 0.028653 0.003448 0.3124 0.534599 0.061617 0.244800 0.789110 0.010007 0.092632 0.220312 0.339685
4 PO 0.018625 0.003448 0.4024 0.557089 0.037346 0.578774 0.630554 0.006757 0.073914 0.270908 0.273045
In [ ]:
sc = '_std'
nums_std = []
for s in nums:
    s = s + sc
    nums_std.append(s)
print(nums_std)
['w_std', 'ar_std', 'sp_std', 're_std', 'rr_std', 'sk_std', 'ku_std', 'hc_std', 'rc_std', 'sc_std', 'kc_std']
In [ ]:
std_Scaler = preprocessing.StandardScaler()

std_Scaler.fit(subsetsall_std[nums])

subsetsall_std[nums_std] = std_Scaler.transform(subsetsall_std[nums])

subsetsall_std.drop(nums, axis=1, inplace=True)
subsetsall_std.head()
Out[ ]:
type w_std ar_std sp_std re_std rr_std sk_std ku_std hc_std rc_std sc_std kc_std
0 PO -0.874553 -0.786637 -0.116198 1.687500 3.592437 -0.462970 2.014749 -0.646807 -0.608549 0.112723 -0.587744
1 PO -0.867095 -0.800094 -0.873054 1.687500 1.916513 -0.783785 -0.847640 -0.649219 -0.550744 0.643202 -0.230305
2 PO -0.882011 -0.800094 -0.739818 2.324728 2.420339 -0.729491 1.073546 -0.650634 -0.513391 0.576477 -0.225458
3 PO -0.770168 -0.800094 -1.330129 0.540485 -0.903687 0.469655 1.223842 -0.597296 -0.716502 -0.141997 -0.921506
4 PO -0.822358 -0.800094 -0.913765 0.625398 -1.020156 2.505449 0.173118 -0.620766 -0.824734 0.211772 -1.363982
In [ ]:
subsetsall_std.describe()
Out[ ]:
w_std ar_std sp_std re_std rr_std sk_std ku_std hc_std rc_std sc_std kc_std
count 2.190000e+02 2.190000e+02 2.190000e+02 2.190000e+02 2.190000e+02 2.190000e+02 2.190000e+02 2.190000e+02 2.190000e+02 2.190000e+02 2.190000e+02
mean -8.111218e-18 -3.244487e-17 1.662800e-16 -9.733462e-17 -6.083414e-18 -7.502877e-17 1.022014e-15 3.244487e-17 -4.461170e-17 9.125121e-17 -4.866731e-17
std 1.002291e+00 1.002291e+00 1.002291e+00 1.002291e+00 1.002291e+00 1.002291e+00 1.002291e+00 1.002291e+00 1.002291e+00 1.002291e+00 1.002291e+00
min -9.118320e-01 -8.000944e-01 -2.658791e+00 -1.475707e+00 -1.198802e+00 -1.012763e+00 -2.886221e+00 -6.694647e-01 -1.228245e+00 -1.672119e+00 -2.987224e+00
25% -7.254280e-01 -7.653339e-01 -8.526988e-01 -9.823347e-01 -7.802472e-01 -6.275409e-01 -5.615597e-01 -5.772531e-01 -6.256668e-01 -6.639940e-01 -7.075824e-01
50% -5.091975e-01 -5.683864e-01 -5.883228e-02 6.654728e-02 -1.722487e-01 -3.363570e-01 -2.232333e-01 -4.229268e-01 -3.357616e-01 -1.781106e-01 2.536041e-02
75% 5.272102e-01 6.058740e-01 1.049620e+00 7.977667e-01 4.401597e-01 1.959837e-01 4.299905e-01 1.516094e-01 2.171876e-01 5.471275e-01 6.215435e-01
max 4.285127e+00 3.236178e+00 1.974872e+00 2.324728e+00 3.618412e+00 5.087585e+00 3.966286e+00 6.907185e+00 4.537491e+00 5.315733e+00 3.629988e+00
In [ ]:
sns.pairplot(subsetsall_std, vars=nums_std, hue='type')
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x21dbe337e20>

The data still looks very similar. We will not use the stnadardized data and instead stick with the original near normal data.

Lets come back to the PO type spike from our pairplots. Lets check if it is because of many PO types in the data:

In [ ]:
sns.countplot(x='type', data=subsetsall)
Out[ ]:
<Axes: xlabel='type', ylabel='count'>

There is not an overwhelming amount of PO types in the data, so it must be truly strongly correlated as represnted by the pairplot Kernel Density Equation diagonal.

We can investigate it further with a boxplot.

In [ ]:
sns.boxplot(x='type', y='hc', data=subsetsall)
Out[ ]:
<Axes: xlabel='type', ylabel='hc'>

Low hc very likely means the weld defect is PO. This is a strong relation we can use later but must ensure we do not overlook other features, as other types also have some low hc values.

More plots of the data can be made to try to spot anything not yet clear if questions arise in later analysis.

Additionals:¶

One thing we might want to do for later is binarize the type column so we can use it in our models:

In [ ]:
types = subsetsall['type'].unique()
types = list(types)
types
Out[ ]:
['PO', 'SL', 'LP', 'LF', 'CR']
In [ ]:
from sklearn.preprocessing import label_binarize

type_num = label_binarize(subsetsall.type, classes=types)
print(type_num)
[[1 0 0 0 0]
 [1 0 0 0 0]
 [1 0 0 0 0]
 ...
 [0 0 0 0 1]
 [0 0 0 0 1]
 [0 0 0 0 1]]
In [ ]:
for i in range(len(types)):
    subsetsall[types[i]] = type_num[:,i]

subsetsall
Out[ ]:
type w ar sp re rr sk ku hc rc sc kc PO SL LP LF CR
0 PO 0.008596 0.006897 0.5748 0.838397 0.998562 0.091802 0.908459 0.003151 0.111302 0.256742 0.389952 1 0 0 0 0
1 PO 0.010029 0.003448 0.4112 0.838397 0.649317 0.039172 0.476520 0.002817 0.121299 0.332611 0.443785 1 0 0 0 0
2 PO 0.007163 0.003448 0.4400 1.007173 0.754309 0.048079 0.766430 0.002621 0.127759 0.323068 0.444515 1 0 0 0 0
3 PO 0.028653 0.003448 0.3124 0.534599 0.061617 0.244800 0.789110 0.010007 0.092632 0.220312 0.339685 1 0 0 0 0
4 PO 0.018625 0.003448 0.4024 0.557089 0.037346 0.578774 0.630554 0.006757 0.073914 0.270908 0.273045 1 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
214 CR 0.277937 0.949262 1.0268 0.102869 0.723013 0.025025 0.468658 0.101296 0.757683 0.231426 0.516244 0 0 0 0 1
215 CR 0.148997 0.720690 0.8172 0.055527 0.509504 0.135456 0.551284 0.010890 0.262126 0.410800 0.530843 0 0 0 0 1
216 CR 0.320917 0.846359 0.7100 0.106793 0.407912 0.027538 0.488077 0.191586 0.757547 0.158517 0.559012 0 0 0 0 1
217 CR 0.322350 0.578386 0.6420 0.143629 0.384393 0.039732 0.492730 0.154902 0.640716 0.218541 0.567931 0 0 0 0 1
218 CR 0.372493 0.799686 0.8580 0.167046 0.235256 0.075930 0.558360 0.268964 0.637409 0.164191 0.586349 0 0 0 0 1

219 rows × 17 columns

In [ ]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() # Create label encoder

encoded_subsetsall = []

for i in range(5):
    subsetsall['type_num'] = le.fit_transform(subsetsall['type']) # encode type column

subsetsall.head(40)
Out[ ]:
type w ar sp re rr sk ku hc rc sc kc PO SL LP LF CR type_num
0 PO 0.008596 0.006897 0.5748 0.838397 0.998562 0.091802 0.908459 0.003151 0.111302 0.256742 0.389952 1 0 0 0 0 3
1 PO 0.010029 0.003448 0.4112 0.838397 0.649317 0.039172 0.476520 0.002817 0.121299 0.332611 0.443785 1 0 0 0 0 3
2 PO 0.007163 0.003448 0.4400 1.007173 0.754309 0.048079 0.766430 0.002621 0.127759 0.323068 0.444515 1 0 0 0 0 3
3 PO 0.028653 0.003448 0.3124 0.534599 0.061617 0.244800 0.789110 0.010007 0.092632 0.220312 0.339685 1 0 0 0 0 3
4 PO 0.018625 0.003448 0.4024 0.557089 0.037346 0.578774 0.630554 0.006757 0.073914 0.270908 0.273045 1 0 0 0 0 3
5 PO 0.011461 0.003448 0.2996 0.773502 0.133474 0.243676 0.452340 0.001358 0.090320 0.259598 0.482290 1 0 0 0 0 3
6 PO 0.015759 0.003448 0.4552 0.627426 0.056636 0.116363 0.678178 0.004218 0.107777 0.266969 0.444385 1 0 0 0 0 3
7 PO 0.027221 0.003448 0.4156 0.557089 0.101013 0.093192 0.939251 0.008386 0.070805 0.002616 0.368677 1 0 0 0 0 3
8 PO 0.030086 0.003448 0.4248 0.513840 0.001971 0.358502 0.653443 0.014692 0.032748 0.172884 0.287875 1 0 0 0 0 3
9 PO 0.035817 0.003448 0.4156 0.669620 0.004285 0.681613 0.451320 0.017031 0.038732 0.211128 0.444937 1 0 0 0 0 3
10 SL 0.050143 0.034648 0.5228 0.516920 0.177013 0.063348 0.494043 0.019171 0.197270 0.301136 0.565477 0 1 0 0 0 4
11 SL 0.025788 0.015617 0.3340 0.471477 0.276169 0.063095 0.507786 0.005105 0.141924 0.245195 0.542960 0 1 0 0 0 4
12 SL 0.063037 0.081610 0.4756 0.551477 0.400952 0.085216 0.625463 0.026221 0.226071 0.105925 0.592413 0 1 0 0 0 4
13 SL 0.067335 0.058621 0.5132 0.611181 0.390934 0.082072 0.670788 0.028746 0.174749 0.118310 0.519074 0 1 0 0 0 4
14 SL 0.042980 0.021438 0.5288 0.766076 0.346107 0.140695 0.653494 0.017571 0.150256 0.382587 0.420232 0 1 0 0 0 4
15 SL 0.138968 0.115517 0.5192 0.513840 0.053111 0.232464 0.481685 0.214936 0.302905 0.081632 0.646642 0 1 0 0 0 4
16 SL 0.147564 0.023731 0.2656 0.884430 0.039914 0.596128 0.654200 0.109447 0.097271 0.268753 0.100178 0 1 0 0 0 4
17 SL 0.074499 0.012645 0.1924 0.631139 0.099007 0.292338 0.753337 0.041720 0.102973 0.206860 0.436098 0 1 0 0 0 4
18 SL 0.074499 0.044562 0.4872 0.599283 0.089800 0.320126 0.538374 0.027256 0.125914 0.240262 0.627607 0 1 0 0 0 4
19 SL 0.065903 0.061524 0.6596 0.513840 0.161838 0.643830 0.269329 0.034770 0.158479 0.210568 0.911416 0 1 0 0 0 4
20 LP 0.415473 0.431379 0.8668 0.068861 0.125231 0.168406 0.767691 0.088235 0.193620 0.131252 0.242035 0 0 1 0 0 2
21 LP 0.613181 0.324466 0.7932 0.253713 0.030618 0.114958 0.715633 0.483561 0.106063 0.041874 0.302604 0 0 1 0 0 2
22 LP 0.253582 0.417241 0.8664 0.110422 0.329608 0.069216 0.537527 0.060204 0.208140 0.010399 0.480510 0 0 1 0 0 2
23 LP 0.187679 0.359769 0.8320 0.037932 0.318771 0.116448 1.113649 0.017783 0.156013 0.068120 0.511571 0 0 1 0 0 2
24 LP 0.550143 0.242717 0.8284 0.087806 0.068943 0.479354 0.693576 0.411346 0.105377 0.289325 0.250541 0 0 1 0 0 2
25 LP 1.000000 0.642338 0.8376 0.112152 0.134046 0.106603 0.670257 0.617477 0.132390 0.144749 0.343697 0 0 1 0 0 2
26 LP 0.777937 0.493869 0.8384 0.046878 0.144770 0.091868 0.657572 0.266960 0.134514 0.331065 0.354818 0 0 1 0 0 2
27 LP 0.246418 0.308045 0.8300 0.004051 0.088274 0.344919 0.551920 0.050698 0.145350 0.306394 0.423902 0 0 1 0 0 2
28 LP 0.339542 0.289921 0.8372 0.102152 0.053839 0.221241 0.570290 0.172724 0.166593 0.162887 0.479350 0 0 1 0 0 2
29 LP 0.343840 0.391379 0.8336 0.015190 0.085972 0.160149 0.595178 0.117161 0.181476 0.339861 0.552816 0 0 1 0 0 2
30 LF 0.402579 0.221838 0.6692 0.209451 0.103607 0.246070 0.577897 0.417545 0.247450 0.204585 0.438135 0 0 0 1 0 1
31 LF 0.063037 0.062834 0.2300 0.304515 0.299782 0.203618 0.640899 0.019800 0.181293 0.189060 0.498422 0 0 0 1 0 1
32 LF 0.206304 0.111686 0.7932 0.353038 0.130069 0.098141 0.547505 0.227999 0.333604 0.163009 0.631823 0 0 0 1 0 1
33 LF 0.071633 0.055172 0.5780 0.409072 0.196221 0.068892 0.592551 0.029966 0.215632 0.344792 0.457783 0 0 0 1 0 1
34 LF 0.075931 0.027790 0.4628 0.571181 0.333461 0.044079 0.570744 0.052599 0.161889 0.131012 0.480531 0 0 0 1 0 1
35 LF 0.073066 0.022607 0.3972 0.601519 0.224584 0.211671 0.751020 0.061937 0.114510 0.136843 0.454621 0 0 0 1 0 1
36 LF 0.110315 0.054652 0.3364 0.513038 0.258901 0.038028 0.499730 0.071062 0.155450 0.101629 0.489576 0 0 0 1 0 1
37 CR 0.100287 0.400000 0.7424 0.152025 0.516416 0.137537 0.555328 0.022585 0.432063 0.414954 0.528984 0 0 0 0 1 0
38 CR 0.209169 0.547510 0.8400 0.127131 0.405834 0.074451 0.543629 0.034410 0.347480 0.284818 0.439879 0 0 0 0 1 0
39 CR 0.277937 0.826724 0.8816 0.085738 0.704132 0.011205 0.486398 0.104069 0.731836 0.149571 0.529300 0 0 0 0 1 0

Classification¶

As explored in data preprocessing and visualziation we will use the re and rc features to classify the data.

Lets review this indivudal scatterplot:

In [ ]:
sns.scatterplot(data=subsetsall, x='re', y='rc', hue='type')

plt.show()

Upon inspection of this plot, it is best to use the following 3 types as our targets: CR, LP, SL. (CR and LP were noted earlier in data preprocessing and visualization, but we will add SL as well due to what we can see more closely in this blowup of the plot.)

More inferences can also be made:

  • It will be easy to make a decision boundary between LP and SL but harder to make a decision with a KNN approach here as the spread over the re axis is larger for SL than LP.

  • For the split between CR and LP, it seems to be the opposite case. A KNN approach might produce better results due to the closely clumped data points of CR near the fuzzy border and decently clumped LP data points slightly fartehr from the border.

Also, before we move on to classification, we will do a quick cleanup / restructure of the data to only contain the features and targets we want as stated above. (We will also leave the data in its subsets as to use these as splits for training and testing.) (Also, we encode the type column for later use.)

In [ ]:
encoded_subsets = []

for i in range(5):
    subsets[i] = subsets[i][['type','re', 'rc']] # only keep the columns we want (by features + target column)
    subsets[i] = subsets[i][subsets[i]['type'].isin(['CR','LP','SL'])] # only keep the rows we want (by type)
    subsets[i].reset_index(drop=True, inplace=True) # reset index
    subsets[i]['type_num'] = le.fit_transform(subsets[i]['type']) # encode type column
    print(subsets[i].shape)
(27, 4)
(27, 4)
(27, 4)
(27, 4)
(27, 4)

The total of these types in each subset was equivalet, that's convenient for cross validation later.

One of the new subsets:

In [ ]:
subsets[4]
Out[ ]:
type re rc type_num
0 SL 0.725865 0.218229 2
1 SL 0.638143 0.212740 2
2 SL 0.504768 0.146195 2
3 SL 0.316709 0.188856 2
4 SL 0.870253 0.217319 2
5 SL 0.730675 0.095278 2
6 SL 0.606329 0.179129 2
7 SL 0.622489 0.101563 2
8 SL 0.771772 0.092916 2
9 SL 0.585232 0.160975 2
10 LP 0.091646 0.182427 1
11 LP 0.162194 0.074074 1
12 LP 0.115063 0.233013 1
13 LP 0.153840 0.207985 1
14 LP 0.185274 0.132899 1
15 LP 0.003038 0.092287 1
16 LP 0.100253 0.130390 1
17 LP 0.002152 0.204965 1
18 LP 0.042278 0.178519 1
19 LP 0.006034 0.292241 1
20 CR 0.142574 0.308130 0
21 CR 0.071941 0.398472 0
22 CR 0.102869 0.757683 0
23 CR 0.055527 0.262126 0
24 CR 0.106793 0.757547 0
25 CR 0.143629 0.640716 0
26 CR 0.167046 0.637409 0

KNN¶

Now we can make our KNN model.

Create training and testing data from the subsets:

In [ ]:
X = []
y = []
for i in range(5):
    X.append(subsets[i][['re', 'rc']].values)
    y.append(subsets[i]['type_num'].values)
print(X[4].shape)
print(y[4].shape)

# Training Data - subsets 1-4 (80% of data)
X_train = np.concatenate(X[:4], axis=0)
y_train = np.concatenate(y[:4], axis=0)
print(X_train.shape)
print(y_train.shape)

# Training Data - subset 5 (20% of data)
X_test = X[4]
y_test = y[4]
print(X_test.shape)
print(y_test.shape)
(27, 2)
(27,)
(108, 2)
(108,)
(27, 2)
(27,)

Jointplot for more visualizaiton before we begin:

In [ ]:
sns.jointplot(x='re', y='rc', data=subsets[4], hue='type_num', kind='scatter')
Out[ ]:
<seaborn.axisgrid.JointGrid at 0x21dc749eda0>

Fit KNN model:

In [ ]:
from sklearn.neighbors import KNeighborsClassifier

k = 5
knn_model = KNeighborsClassifier(n_neighbors=k)

knn_model.fit(X_train, y_train)
Out[ ]:
KNeighborsClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier()

Plot the decision boundary of the model:

In [ ]:
#---min and max for the first feature---
x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1

#---min and max for the second feature---
y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1

#---step size in the mesh---
x_step = (x_max - x_min) / 100
y_step = (y_max - y_min) / 100

#---make predictions for each of the points in xx,yy---
xx, yy = np.meshgrid(np.arange(x_min, x_max, x_step), np.arange(y_min, y_max, y_step))

Z = knn_model.predict(np.c_[xx.ravel(), yy.ravel()])

#---draw the result using a color plot---
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Accent, alpha=0.8)

#---plot the training points---
colors = ['red', 'green', 'blue']
types = ['CR', 'LP', 'SL']
for color, i, target in zip(colors, [0, 1, 2], types):
    plt.scatter(X_train[y_train==i, 0], X_train[y_train==i, 1], color=color, label=target)

plt.xlabel('Roughness of Defect Edge (re)')
plt.ylabel('Roughness Contrast (rc)')
plt.title(f'Decision Surface for KNN model with (k={k})')
plt.legend(loc='best', shadow=False, scatterpoints=1)
Out[ ]:
<matplotlib.legend.Legend at 0x21dc70dbbe0>

It seems like there may be some slight overfitting here, but it is not too bad. We will see how it performs.

Predict weld defect types using testing data:

In [ ]:
y_pred = knn_model.predict(X_test)
print(y_pred)
[2 2 2 1 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0]
In [ ]:
from sklearn.metrics import confusion_matrix

mat_test = confusion_matrix(y_test, y_pred)
print('confusion matrix = \n', mat_test)
confusion matrix = 
 [[7 0 0]
 [1 9 0]
 [0 1 9]]
In [ ]:
fig, ax = plt.subplots(1, 1, figsize=(4, 4))
cm = confusion_matrix(y_test, y_pred)
ax = sns.heatmap(cm, annot=True, square=True, xticklabels=types, yticklabels=types)
ax.set_xlabel('Predicted Labels')
ax.set_ylabel('Actual Labels')
Out[ ]:
Text(17.25, 0.5, 'Actual Labels')

Some precursors to confusion matrix calculations:

In [ ]:
# True Positive (TP) = diagonal elements
CR_TP = mat_test[0,0]
LP_TP = mat_test[1,1]
SL_TP = mat_test[2,2]
print(CR_TP, LP_TP, SL_TP) 

# False Negative (FN) = sum of row - TP
CR_FN = sum(mat_test[0])-CR_TP
LP_FN = sum(mat_test[1])-LP_TP
SL_FN = sum(mat_test[2])-SL_TP
print(CR_FN, LP_FN, SL_FN)

# False Positive (FP) = sum of column - TP
CR_FP = sum(mat_test[:,0])-CR_TP
LP_FP = sum(mat_test[:,1])-LP_TP
SL_FP = sum(mat_test[:,2])-SL_TP
print(CR_FP, LP_FP, SL_FP)
7 9 9
0 1 1
1 1 0

The True Positive Rate (or Recall or Sensitivity) can be calculated using the formula:

image-2.png

In [ ]:
CR_TPR = CR_TP/(CR_TP+CR_FN)
LP_TPR = LP_TP/(LP_TP+LP_FN)
SL_TPR = SL_TP/(SL_TP+SL_FN)
print(CR_TPR, LP_TPR, SL_TPR)
1.0 0.9 0.9

The Positive Predictive Rate (or Precision) can be calculated using the formula:

image.png

In [ ]:
CR_PPR = CR_TP/(CR_TP+CR_FP)
LP_PPR = LP_TP/(LP_TP+LP_FP)
SL_PPR = SL_TP/(SL_TP+SL_FP)
print(CR_PPR, LP_PPR, SL_PPR)
0.875 0.9 1.0

Final Accuracy can be calulated using:

image.png

In [ ]:
PPR = (CR_TP + LP_TP + SL_TP)/sum(sum(mat_test))
print(PPR)
0.9259259259259259

Let's verify our manual calculations with the classification_report function:

In [ ]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names=types))
              precision    recall  f1-score   support

          CR       0.88      1.00      0.93         7
          LP       0.90      0.90      0.90        10
          SL       1.00      0.90      0.95        10

    accuracy                           0.93        27
   macro avg       0.92      0.93      0.93        27
weighted avg       0.93      0.93      0.93        27

An ~93% accurate model is not bad here. CR is 100% accurately predicted, most likely largely due to the large clump of its datapoints near its border with LP. Actual LP and SL points each have 1 as misclassified. The LP misclassification has its datapoint over in the CR region while SL has its datapoint in the LP region. This was expected when first visualizing the data in the singular re vs rc scatterplot earlier. However, the accuracy turned out greater than expected.

Finding best K value¶

Our model works well but it is always good to check if we can improve it. We can use a for loop to find the best K value for our model:

In [ ]:
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

ac_scores = []
k_neighbors = list(range(1,21))
k_neighbors = [k for k in k_neighbors if k % 3 != 0] # remove multiples of 3 to avoid ties

for k in k_neighbors: 
    knn_model = KNeighborsClassifier(n_neighbors=k)
    knn_model.fit(X_train, y_train)
    y_pred = knn_model.predict(X_test)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print(f'k={k}: {f1*100:0.2f}%')
    # print(classification_report(y_test, y_pred, target_names=types))

    ac_score = accuracy_score(y_test, y_pred)
    ac_scores.append(ac_score)
k=1: 92.59%
k=2: 89.06%
k=4: 92.62%
k=5: 92.62%
k=7: 92.69%
k=8: 89.06%
k=10: 92.69%
k=11: 92.69%
k=13: 92.69%
k=14: 92.69%
k=16: 92.69%
k=17: 92.69%
k=19: 92.69%
k=20: 92.69%

It seems that our original and default k value of 5 was very good for our model. However, there is a slightly higher percentage when going to k=10 and above. That slight increase in accuracy may not be worth the compuational power in some other cases but here it does not affect our research usage so we can change our k to 10 or so if we were to predict further or perform k-fold cross validation. Higher k values also reduced risk of overfitting.

Misclassification Error¶

Looking into and plotting the Misclassification Error (MSE):

In [ ]:
# changing to misclassification error:
MSE = [1 - x for x in ac_scores]

# determining best k:
optimal_k = k_neighbors[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % optimal_k)
The optimal number of neighbors is 1
In [ ]:
 # plot misclassification error vs k:
plt.plot(k_neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()

We can see that at 1 neighbor, the error is lowest. However, this would be heavily overfit. Our original k value of 5 can be good but in between two peaks of error, this may not be the best choice. As stated before we even looked at MSE, we should go for a higher k. Perhaps 11 or 13 would be a comfortable choice.

K-Fold Cross Validation Using Entire Dataset¶

Instead of just using 2 features and 3 weld defect types, we can use the entire dataset with all features and all 5 weld defect types to see how this model will do. We will do this across multiple folds as well.

In [ ]:
subsetsall # our combination of subsets as one large dataset (with encoded type columns, and 1 duplicate row removed and  as discussed earlier)
Out[ ]:
type w ar sp re rr sk ku hc rc sc kc PO SL LP LF CR type_num
0 PO 0.008596 0.006897 0.5748 0.838397 0.998562 0.091802 0.908459 0.003151 0.111302 0.256742 0.389952 1 0 0 0 0 3
1 PO 0.010029 0.003448 0.4112 0.838397 0.649317 0.039172 0.476520 0.002817 0.121299 0.332611 0.443785 1 0 0 0 0 3
2 PO 0.007163 0.003448 0.4400 1.007173 0.754309 0.048079 0.766430 0.002621 0.127759 0.323068 0.444515 1 0 0 0 0 3
3 PO 0.028653 0.003448 0.3124 0.534599 0.061617 0.244800 0.789110 0.010007 0.092632 0.220312 0.339685 1 0 0 0 0 3
4 PO 0.018625 0.003448 0.4024 0.557089 0.037346 0.578774 0.630554 0.006757 0.073914 0.270908 0.273045 1 0 0 0 0 3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
214 CR 0.277937 0.949262 1.0268 0.102869 0.723013 0.025025 0.468658 0.101296 0.757683 0.231426 0.516244 0 0 0 0 1 0
215 CR 0.148997 0.720690 0.8172 0.055527 0.509504 0.135456 0.551284 0.010890 0.262126 0.410800 0.530843 0 0 0 0 1 0
216 CR 0.320917 0.846359 0.7100 0.106793 0.407912 0.027538 0.488077 0.191586 0.757547 0.158517 0.559012 0 0 0 0 1 0
217 CR 0.322350 0.578386 0.6420 0.143629 0.384393 0.039732 0.492730 0.154902 0.640716 0.218541 0.567931 0 0 0 0 1 0
218 CR 0.372493 0.799686 0.8580 0.167046 0.235256 0.075930 0.558360 0.268964 0.637409 0.164191 0.586349 0 0 0 0 1 0

219 rows × 18 columns

In [ ]:
X, y = subsetsall[nums].values, subsetsall['type'].values
print(X.shape)
print(y.shape)
(219, 11)
(219,)
In [ ]:
from sklearn.model_selection import cross_val_score

#---holds the cv (cross-validates) scores---
cv_scores = []

#---number of folds---
folds = 10

#---creating odd list of K for KNN---
ks = list(range(1, int(len(X) * ((folds - 1)/folds))))

# ---remove all multiples of 5 as this is a 5 class problem and we want to avoid ties---
ks = [k for k in ks if k % 5 != 0]

#---perform k-fold cross validation---
for k in ks:
    knn = KNeighborsClassifier(n_neighbors=k)
    #---performs cross-validation and returns the average accuracy---
    scores = cross_val_score(knn, X, y, cv=folds, scoring='accuracy')
    mean = scores.mean()
    cv_scores.append(mean)
    print(k, mean)
1 0.8123376623376621
2 0.79004329004329
3 0.7898268398268399
4 0.8175324675324676
6 0.8220779220779221
7 0.8220779220779221
8 0.8359307359307359
9 0.8177489177489177
11 0.7902597402597402
12 0.8043290043290042
13 0.7993506493506494
14 0.7904761904761906
16 0.7902597402597402
17 0.7764069264069263
18 0.7809523809523811
19 0.7945887445887446
21 0.7673160173160173
22 0.7673160173160174
23 0.7627705627705628
24 0.7718614718614718
26 0.7627705627705628
27 0.7675324675324675
28 0.7629870129870129
29 0.7766233766233765
31 0.7629870129870129
32 0.772077922077922
33 0.7675324675324674
34 0.7448051948051948
36 0.7629870129870129
37 0.7445887445887445
38 0.7536796536796537
39 0.7627705627705628
41 0.7627705627705628
42 0.7536796536796536
43 0.7582251082251082
44 0.7584415584415585
46 0.7584415584415585
47 0.762987012987013
48 0.7443722943722944
49 0.758008658008658
51 0.7625541125541125
52 0.758008658008658
53 0.7534632034632034
54 0.758008658008658
56 0.7489177489177489
57 0.7352813852813853
58 0.7352813852813853
59 0.7214285714285714
61 0.7032467532467532
62 0.6941558441558442
63 0.7032467532467533
64 0.7077922077922079
66 0.6941558441558441
67 0.675974025974026
68 0.675974025974026
69 0.666883116883117
71 0.6712121212121211
72 0.6530303030303031
73 0.6393939393939394
74 0.6300865800865803
76 0.6209956709956709
77 0.6164502164502164
78 0.607142857142857
79 0.5935064935064934
81 0.5935064935064934
82 0.598051948051948
83 0.5889610389610389
84 0.5889610389610389
86 0.5798701298701298
87 0.5932900432900432
88 0.5841991341991342
89 0.5839826839826839
91 0.5703463203463203
92 0.5748917748917749
93 0.5658008658008657
94 0.5567099567099567
96 0.566017316017316
97 0.5478354978354978
98 0.5432900432900433
99 0.5387445887445887
101 0.5432900432900433
102 0.5387445887445887
103 0.5432900432900433
104 0.5341991341991341
106 0.5203463203463203
107 0.5294372294372294
108 0.533982683982684
109 0.5158008658008658
111 0.5203463203463203
112 0.5203463203463203
113 0.5203463203463203
114 0.5203463203463203
116 0.5021645021645021
117 0.4976190476190476
118 0.4976190476190476
119 0.4976190476190476
121 0.48398268398268396
122 0.48852813852813853
123 0.4930735930735931
124 0.4930735930735931
126 0.48852813852813853
127 0.48398268398268396
128 0.47943722943722944
129 0.48398268398268396
131 0.48398268398268396
132 0.48852813852813853
133 0.4930735930735931
134 0.47943722943722944
136 0.47943722943722944
137 0.47943722943722944
138 0.48852813852813853
139 0.47943722943722944
141 0.4930735930735931
142 0.48852813852813853
143 0.474891774891775
144 0.474891774891775
146 0.47489177489177486
147 0.47489177489177486
148 0.474891774891775
149 0.474891774891775
151 0.4703463203463204
152 0.46580086580086577
153 0.4703463203463204
154 0.474891774891775
156 0.48398268398268396
157 0.4930735930735931
158 0.5112554112554113
159 0.5021645021645021
161 0.49783549783549785
162 0.4932900432900433
163 0.49350649350649345
164 0.507142857142857
166 0.498051948051948
167 0.4932900432900433
168 0.4796536796536796
169 0.47510822510822515
171 0.48441558441558435
172 0.4796536796536796
173 0.4796536796536796
174 0.4705627705627705
176 0.4705627705627705
177 0.46580086580086577
178 0.461038961038961
179 0.45649350649350656
181 0.45649350649350656
182 0.45649350649350656
183 0.45649350649350656
184 0.46580086580086577
186 0.451948051948052
187 0.4428571428571429
188 0.4292207792207792
189 0.4337662337662337
191 0.42467532467532465
192 0.42012987012987013
193 0.41558441558441556
194 0.4012987012987013
196 0.32835497835497834
In [ ]:
#---calculate misclassification error for each k---
MSE = [1 - x for x in cv_scores]

#---determining best k (min. MSE)---
optimal_k = ks[MSE.index(min(MSE))]
print(f"The optimal number of neighbors is {optimal_k}")

#---plot misclassification error vs k---
plt.plot(ks, MSE)
plt.plot(optimal_k, MSE[optimal_k-1], 'r', marker='*', label='optimal k')
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error (MSE)')
plt.legend()
plt.show()
The optimal number of neighbors is 8
In [ ]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=5)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(175, 11)
(175,)
(44, 11)
(44,)
In [ ]:
k = optimal_k
knn_model = KNeighborsClassifier(n_neighbors=k)

knn_model.fit(X_train, y_train)
Out[ ]:
KNeighborsClassifier(n_neighbors=8)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=8)
In [ ]:
y_pred = knn_model.predict(X_test)
In [ ]:
mat_test = confusion_matrix(y_test, y_pred)
print('confusion matrix = \n', mat_test)
confusion matrix = 
 [[ 6  0  0  0  0]
 [ 0  6  0  1  1]
 [ 0  0 11  0  0]
 [ 0  1  0  8  1]
 [ 0  4  0  2  3]]
In [ ]:
types = subsetsall['type'].unique()
types = list(types)

fig, ax2 = plt.subplots(1, 1, figsize=(4, 4))
cm = confusion_matrix(y_test, y_pred)
ax2 = sns.heatmap(cm, annot=True, square=True, xticklabels=types, yticklabels=types)
ax2.set_xlabel('Predicted Labels')
ax2.set_ylabel('Actual Labels')
Out[ ]:
Text(17.25, 0.5, 'Actual Labels')
In [ ]:
print(classification_report(y_test, y_pred, target_names=types))
              precision    recall  f1-score   support

          PO       1.00      1.00      1.00         6
          SL       0.55      0.75      0.63         8
          LP       1.00      1.00      1.00        11
          LF       0.73      0.80      0.76        10
          CR       0.60      0.33      0.43         9

    accuracy                           0.77        44
   macro avg       0.77      0.78      0.76        44
weighted avg       0.77      0.77      0.76        44

The overall accuracy is lower here than when using only 2 features and 3 weld defect types with k=5 neighbors. If you are knowledgeable in the machine learning space, this can be expected as the model is more complex and has more possibly useless or detrimental data to work with. (However, it is still a decent accuracy and we can see that the model is not overfitting due to k fold cross validation.)

We can use this model for predicting PO and LP weld defect types as they have a perfect f1-score [at least in this random split of the data]. Also, this possibly could have predicted much earlier by looking at theses two types' density in the inital pairplot (namely ar vs ar and sp vs sp, but high in others as well).
More curated features (such as in our first KNN explored here) and perhaps less type possibllities (by first filtering out PO and LP for example) would be better for predicting the other types.

Mainly, this exploration of the data (first with a curated selection and then a full set) shows the importance of tuning the model with the best features to show the best type splits. It is not always best to use all features and all types. You can't just throw them all in and expect the best results. Muliple models should be made and used in conjunction with each other to get the best prediction results, in this weld defect dataset and in general.